Whole Exome genomic analysis of NRG-GY003 ovarian cancer tumors
Genomic analysis of tumor exomes from patients with recurrent ovarian cancer treated with immune checkpoint inhibitors (NIVO or IPI/NIVO) treated on the NRG-GY003 clinical trial.
Author
Affiliation
Kelsey Monson, PhD, MS
Icahn School of Medicine at Mount Sinai
Published
August 1, 2025
Keywords
Genomics, Whole Exome, Immunotherapy, R, Ovarian Cancer
Intro
Variant Filtering Strategy
I first used the Nextflow Sarek pipeline to align the raw fastq files. Below is the workflow I used:
Note that, at this time, I have only merged the variants from the SNP/indel callers, as ASCAT and the other SV/CNV callers do not have .vcfs as their output.
Here is the detailed variant filtering strategy I used to come up with the MAF file used in this analysis:
First find consensus somatic calls:
Present in >=2 somatic callers (Freebayes, Mutect2, Strelka2)
Include variants annotated with PASS
PASS: indicates the variant passed all quality filters applied by the variant caller
Considered the gold-standard for high-quality variants passing the variant caller’s built-in quality control filters
Further refine PASS variants
Minimum tumor Read Depth = 20: ensures sufficient coverage (i.e. evidence) to confidently call tumor variants from normal and/or distinguish from random sequencing errors
Allele frequency 0.05 < AF < 0.95:
>0.05: also helps avoid random sequencing errors/miss-called variants
<0.95: helps avoid germline contamination
Identify rare pathogenic germline variants:
Present in >=2 germline callers (Freebayes, Haplotypecaller, Strelka2)
Identify those most likely to be pathogenic
Limit to protein-coding variants (drop all intergenic and non-coding variants)
Must be annotated with HIGH impact
Eliminate common variants
“Rare” variants are typically those present in <1% of the population
Because I was worried about missing variants, I initially excluded those with >5% population frequency, but this was too liberal.
I revised the script to set the threshold to 1% and will re-generate the MAF file eventually
For the current analysis, I QC’d the variants and only identified likey oncogenic variants in BRCA1/2, so limited the germline analysis to those genes.
Include all likely oncogenes regardless of population frequency (“BRCA1” “BRCA2” “TP53” “PTEN” “ATM” “CHEK2” “PALB2” “APC” “MUTYH”)
Generated consensus MAF file
The above analysis generated two vcfs, one with the processed somatic variants and one with likely pathogenic germline variants
These variants were annotated as SOMATIC or GERMLINE_PATHOGENIC, respectively in the vcfs
I then concatenated these variants to one MAF file for the entire sample using the Nextflow vcf2maf pipeline.
This consensus MAF is what is used for the downstream analysis presented here.
Loading in the data
Packages Used
# Load packageslibrary(maftools) # For majority of maf file analysislibrary(mclust)library("BSgenome.Hsapiens.UCSC.hg38", quietly =TRUE)library(NMF) # For signature calculationlibrary(pheatmap) # For nice heatmapslibrary(ghibli) # For pretty colorslibrary(gt) # For nice tables (maybe?)
Load in raw MAF file and the clinical data
laml =read.maf(maf="input/merged_consensus_all_samples_germline.maf",clinicalData ="input/NRG-GY00_laml_annot_2.csv",rmFlags =TRUE# "FLAGS, frequently mutated genes in public exomes" )
-Reading
-Validating
-Removing 20 FLAG genes
-Silent variants: 30668
-Summarizing
-Processing clinical data
-Finished in 3.380s elapsed (3.110s cpu)
Tip
Setting rmFlags = TRUE removes frequently mutated genes in public exomes
Data cleaning: setting levels for factor variables
# RECIST response (CR, PR, SD, PD)laml@clinical.data$response <-factor(laml@clinical.data$response, levels=c("CR","PR","SD","PD"))#table(laml@clinical.data$response)# Detailed response including durable and progressive SDlaml@clinical.data$response2 <-factor(laml@clinical.data$response2, levels=c("CR","PR","SD_CB","SD_NCB","PD"))# table(laml@clinical.data$response2)# Response by treatmentlaml@clinical.data$response_tx <-factor(laml@clinical.data$response_tx, levels=c("Nivo_CB","Combo_CB","Nivo_NCB","Combo_NCB"))#table(laml@clinical.data$response_tx)# Race (White, Non-White, and Not Reported)laml@clinical.data$race2 <-factor(laml@clinical.data$race2, levels=c("W","NW","NR"))#table(laml@clinical.data$race2)# Site (dichotomous)laml@clinical.data$Site2 <-factor(laml@clinical.data$Site2, levels=c("Adnexa","Non_Adnexa"))#table(laml@clinical.data$Site2)# Primary/Metlaml@clinical.data$Primary_Met <-factor(laml@clinical.data$Primary_Met, levels=c("P","M"))#table(laml@clinical.data$Primary_Met)
Subsetting datasets
We have some normal samples with no matched tumor, and we may be interested in doing subtype-specific analysis.
Let’s subset to only tumor samples, and then create further subsetted datasets based on clinical characteristics.
Only tumor samples
I also annotated with pathogenic germline variants; the only relevant ones were in BRCA1 and BRCA2, so we are subsetting to somatic mutations and pathogenic germline variants.
# Subsetting the other dataset to "not serous"`%ni%`<-Negate(`%in%`)laml_other =subsetMaf(maf = laml_tum, clinQuery ="celltype %ni% 'Serous_Adenocarcinoma'")
## Subsetting by treatment to see if there are differences### There shouldn't be any, since treatment assignment was randomized, but is is good to confirm.laml_nivo =subsetMaf(maf = laml_tum, clinQuery ="tx %in% 'Nivo'")
-subsetting by clinical data..
--43 samples meet the clinical query
-Processing clinical data
View the MAF file summary
Here is a basic summary of the MAF file
laml_tum
An object of class MAF
ID summary Mean Median
<char> <char> <num> <num>
1: NCBI_Build GRCh38 NA NA
2: Center . NA NA
3: Samples 86 NA NA
4: nGenes 6897 NA NA
5: Frame_Shift_Del 234 2.721 2.0
6: Frame_Shift_Ins 43 0.500 0.0
7: In_Frame_Del 83 0.965 0.0
8: In_Frame_Ins 6 0.070 0.0
9: Missense_Mutation 9193 106.895 81.0
10: Nonsense_Mutation 524 6.093 4.0
11: Nonstop_Mutation 13 0.151 0.0
12: Splice_Site 505 5.872 3.0
13: Translation_Start_Site 19 0.221 0.0
14: total 10620 123.488 92.5
These are some more summaries you can generate
# I'm commenting them out as they have too much info for the tables to be readable.# #Shows sample summary.# getSampleSummary(laml_tum)# #Shows gene summary.# getGeneSummary(laml_tum)# #shows clinical data associated with samples# getClinicalData(laml_tum)# #Shows all fields in MAF# getFields(laml_tum)
Visual Summaries
Summarize the maf file
We can use plotmafSummary to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification.
#Use mafbarplot for a minimal barplot of mutated genes.mafbarplot(maf = laml_tum)
Summarize the other subsets
Here are the mutation distributions for the other subsets.
As a sanity check, the majority of Serous samples have TP53 mutations, while there are few TP53 mutations in the top 10 for the non-Serous samples (as is expected).
Oncoprints (or “oncoplots” as the wrapper function in maftools is called) summarize complex genomic features of a given dataset.
#oncoplot for top ten mutated genes.oncoplot(maf = laml_tum, top =10)
These are the key features and how they are interpreted:
Columns represent individual patients.
Reading column-wise, you can see the collection of alterations in a patient’s tumor for the given set of genes.
The bar plot on the top is a count of tumor mutation burden per patient, color-coded by mutation type.
Rows are altered genes.
Alterations are color-coded to indicate their type (e.g. mutation, deletion, insertion)
Genes can be further annotated with key features (e.g. high impact/likely pathogenic mutations, germline variants, etc.)
The bar plot on the right summarizes alterations in a given gene for the entire dataset.
Many more features and annotations can be added to further characterize the dataset.
#KelseyNote
Something to be aware of, especially as there are a number of mucins that are coming up, is that some samples were tumor-only and some were matched normal; would be good to annotate the metadata with this info and see if the mucins are coming up more in the tumor-only samples, since these genes tend to be highly polymorphic.
Also re: mucins, MUC16 is one of the genes that’s filtered out when using the FLAGS filter (it’s #2 on the list of top mutated genes). If we include it, it’s the 13th most mutated gene, but there are only 9 patients who have mutations.
Questions
Do we care about MUC16 mutations?
Chatted with Yanis about this, we might be interested in this, at least in relation to his project for where the mutations are coming from (if they’re all in the cleaved region, we don’t care so much, but if they’re in the membrane-bound region, this could be a problem). He will see where his antibody binds, and I will work on making a lollipop plot in our data. Based on cBioPortal, all the ovarian cancer mutations are in the extracellular region, not the transmembrane or cytoplasmic regions (which is good).
Should we be excluding the other mucins from the analysis? (And does it matter beyond individual gene-level analysis? It could affect multiple testing adjustments.)
# Highlight by specific attribute in MAF fileoncoplot(maf = laml_tum, top =10,additionalFeature =c("IMPACT", "HIGH"))
# Add transitions/transversions oncoplot(maf = laml_tum, top =10, draw_titv =TRUE)
sigpw plots enrichment for known oncogenic signaling pathways as defined in TCGA, Sanchez/Vega et al.
# Plot genes in 2 top oncogenic signaling pathwaysoncoplot(maf = laml_tum, pathways ="sigpw", gene_mar =8, fontSize =0.6, topPathways =2)
# Collapse to just plot the pathways, now top 5oncoplot(maf = laml_tum, pathways ="sigpw", gene_mar =8, fontSize =0.6, topPathways =5, collapsePathway =TRUE)
# Top 10 pathways*# *I've set `topPathways` = 10 but this is 24 pathways, not sure what's going on there...oncoplot(maf = laml_tum, pathways ="smgbp", gene_mar =8, fontSize =0.6, topPathways =10, collapsePathway =TRUE)
I need to better understand what specifically is being plotted here; based on cBioPortal, and because this is the protein that’s being shown, it’s only the exons, but I need to better understand these functional regions that are being highlighted.
Setting detectChangePoints = TRUE detects genomic change points where potential kataegis are found.
Kataegis are genomic segments containing six or more consecutive mutations with an average inter-mutation distance of <=1,000 bp.
More on kataegis
From the maftools help page for rainfallPlot():
Kategis detection algorithm by Moritz Goretzky at WWU Munster, which exploits the definition of Kategis (six consecutive mutations with an avg. distance of 1000 bp ) to identify hyper mutated genomic loci.
Algorithm starts with a double-ended queue to which six consecutive mutations are added and their average inter-mutation distance is calculated.
If the average inter-mutation distance is larger than 1000, one element is added at the back of the queue and one is removed from the front.
If the average inter-mutation distance is less or equal to 1000, further mutations are added until the average inter-mutation distance is larger than 1000.
After that, all mutations in the double-ended queue are written into output as one kataegis and the double-ended queue is reinitialized with six mutations.
This is much better than when I used the Genewiz analysis (where they likely used a common reference genome and called everything outside that a somatic mutation), but this still seems a bit high compared to the ovarian cohort.
I know most of the TCGA samples are primaries, but I guess most of these are too, so it should be fairly comparable. Need to consider how much of a problem this is (or isn’t).
Analysis
Somatic interactions
Mutually exclusive or co-occurring set of genes can be detected using the somaticInteractions function, which performs pair-wise Fisher’s Exact test to detect such significant pair of genes.
# Exclusive/co-occurance event analysis on top 10 mutated genes## Full cohortsomaticInteractions(maf = laml_tum, top =25, pvalue =c(0.05, 0.1))
Something to be cautious of, especially in the CB/NCB analysis, is that the cohort is enriched for Serous histology but that may not be evenly distributed between CB/NCB groups (should test this).
So we could be picking up on histological differences that we mistake for real response signals.
Pfam domains
The pfamDomains function performs domain enrichment analysis, grouping protein domains to identify the most dysregulated pathways and protein families involved in similar functions.
pfamDomains summarizes amino acid positions and annotates them with pfam domain info.
It returns two tables summarized by 1) amino acid positions and 2) domains.
It also plots the top most mutated domains as a scatter plot.
# Scatter plot with the top 10 mutated domainslaml.pfam =pfamDomains(maf = laml_tum, top =10)
First immunoglobulin (Ig)-like domain found in Leukocyte Ig-like receptors (LILR)B1 (also known as LIR-1) and similar proteins
PPIAL4D
108
Missense_Mutation
4
4
1.00000000
cyclophilin_ABH_like
cd01926
cyclophilin_ABH_like: Cyclophilin A, B and H-like cyclophilin-type peptidylprolyl cis- trans isomerase (PPIase) domain. This family represents the archetypal cystolic cyclophilin similar to human cyclophilins A, B and H. PPIase is an enzyme which...
CYP2D6
486
Missense_Mutation
3
6
0.50000000
p450
pfam00067
Cytochrome P450
DND1
68
Missense_Mutation
3
3
1.00000000
RRM
cd00590
RRM (RNA recognition motif), also known as RBD (RNA binding domain) or RNP (ribonucleoprotein domain), is a highly abundant domain in eukaryotes found in proteins involved in post-transcriptional gene expression processes including mRNA and rRNA...
IGF2BP3
503
Missense_Mutation
3
4
0.75000000
KH-I
cd00105
K homology RNA-binding domain, type I. KH binds single-stranded RNA or DNA. It is found in a wide variety of proteins including ribosomal proteins, transcription factors and post-transcriptional modifiers of mRNA. There are two different KH domains that...
One of three types of internal repeats found in the plasma protein fibronectin. Its tenth fibronectin type III repeat contains an RGD cell recognition sequence in a flexible loop between 2 strands. Approximately 2% of all...
Fibronectin type 3 domain
P53
53
1
cd08367
P53 DNA-binding domain
COG2319
52
43
NA
FOG: WD40 repeat [General function prediction only]
Here are the results stratified by histology
# Serouslaml.pfam.ser =pfamDomains(maf = laml_ser, top =10)
# Non-Serouslaml.pfam.other =pfamDomains(maf = laml_other, top =10)
Here they are stratified by response
# CBlaml.pfam.CB =pfamDomains(maf = laml_CB, top =10)
typically contains a GH dipeptide 11-24 residues from...
WD40 domain, found in a number of eukaryotic proteins that cover a wide variety of functions including adaptor/regulatory modules in signal transduction, pre-mRNA processing and cytoskeleton assembly
COG2319
19
18
NA
FOG: WD40 repeat [General function prediction only]
FN3
19
18
One of three types of internal repeats found in the plasma protein fibronectin. Its tenth fibronectin type III repeat contains an RGD cell recognition sequence in a flexible loop between 2 strands. Approximately 2% of all...
Fibronectin type 3 domain
P53
18
1
cd08367
P53 DNA-binding domain
LamG
15
14
Laminin G-like domains are usually Ca++ mediated receptors that can have binding sites for steroids, beta1 integrins, heparin, sulfatides, fibulin-1, and alpha-dystroglycans. Proteins that contain LamG domains serve a variety of...
Laminin G domain
ANK
14
13
ankyrin repeats mediate protein-protein interactions in very diverse families of proteins. The number of ANK repeats in a protein can range from 2 to over 20 (ankyrins, for example). ANK repeats may occur in combinations with other...
ankyrin repeats
I-set
14
13
pfam07679
Immunoglobulin I-set domain
# NCBlaml.pfam.NCB =pfamDomains(maf = laml_NCB, top =10)
One of three types of internal repeats found in the plasma protein fibronectin. Its tenth fibronectin type III repeat contains an RGD cell recognition sequence in a flexible loop between 2 strands. Approximately 2% of all...
Fibronectin type 3 domain
P53
35
1
cd08367
P53 DNA-binding domain
COG2319
33
27
NA
FOG: WD40 repeat [General function prediction only]
I-set
26
22
pfam07679
Immunoglobulin I-set domain
S_TKc
24
21
smart00220
Serine/Threonine protein kinases, catalytic domain
Ig
23
18
cl11960
Immunoglobulin domain
ANK
21
18
ankyrin repeats mediate protein-protein interactions in very diverse families of proteins. The number of ANK repeats in a protein can range from 2 to over 20 (ankyrins, for example). ANK repeats may occur in combinations with other...
ankyrin repeats
#KelseyNote
The domains are largely similar between CB and NCB, but CB has WD40 in the top 4, but it’s #20 for NCB.
LamG is also only in CB top 10 (#13 for NCB), while S_TKc and Ig are in NCB top 10 (Ig is #19 for CB; S_TKc not in CB top 20).
Again, not sure how meaningful this kind of analysis is.
Survival Analysis
#Survival analysis based on grouping of TP53 mutation statusmafSurvival(maf = laml_tum, genes ='TP53', time ='Survtime', Status ='Survstat')
TP53
59
Group medianTime N
<char> <num> <int>
1: Mutant 26.84 59
2: WT 14.19 27
# This...looks wrong. I don't think the events are being recognized correctly.